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INTRODUCTION 

The  primary  goal  of  our  effort  is  the  development  of  robust  and  portable  language  processing 
c^abilities  for  information  extraction  {^plications.  The  system  under  evaluation  hoe  is  based  on  language 
processing  components  that  have  demonstrated  strong  poformance  capabilities  in  previous  evaluations 
[Lehnert  et  al.  1992a].  Having  demonstrated  the  general  viability  of  these  techniques,  we  are  now 
concentrating  on  the  practicality  of  our  technology  by  creating  trainable  system  components  to  replace 
hand-coded  data  and  manually-engineaed  software. 

Our  genoal  strategy  is  to  automate  the  construction  of  domain-specific  dictionaries  and  other  language- 
related  resources  so  that  information  extraction  can  be  customized  for  speciftc  ai^lications  with  a  minimal 
amount  of  human  assistance.  We  employ  a  hybrid  system  architecture  that  combines  selective  concept 
extraction  [Lehnot  1991]  technologies  develops  at  UMass  with  trainable  classifier  technologies  develop^ 
at  Hughes  [Dolan  et  al.  1991].  Our  MUC-S  system  incorporates  seven  trainable  language  components  to 
handle  (1)  lexical  recognition  and  part-of-speech  tagging,  (2)  knowledge  of  semantic/syntactic  interactions, 
(3)  semantic  feature  tagging,  (4)  noun  phra%  analysis,  (S)  limited  coreference  resolution,  (6)  domain  object 
recognition,  and  (7)  relational  link  recognition.  Our  trainable  components  have  been  developed  so  domain 
experts  who  have  no  background  in  natural  language  or  machine  learning  can  train  individual  system 
components  in  the  space  of  a  few  hours. 

Many  critical  aspects  of  a  complete  information  extraction  are  not  appropriate  for  customization  or 
trainable  knowledge  acquisition.  For  example,  our  system  uses  low-level  text  specialists  designed  to 
recognize  dates,  locations,  revenue  objects,  and  other  common  constructions  that  involve  knowledge  of 
conventional  language.  Resources  of  this  type  are  portable  across  domains  (although  not  all  domains  require 
all  specialists)  and  should  be  developed  as  sharable  language  resources.  The  UMass/Hughes  focus  has  been 
on  otha  aspects  of  information  extraction  that  can  benefit  from  corpus-based  knowledge  acquisition.  For 
example,  in  any  given  information  extraction  application,  some  sentoices  are  more  important  than  others, 
and  within  a  single  sentence  some  phrases  are  more  important  than  others.  When  a  dictionary  is  customized 
for  a  specific  application,  vocabulary  coverage  can  be  sensitive  to  the  fact  that  a  lot  of  words  contribute 
little  or  no  inframation  to  the  final  extraction  task:  full  dictionary  coverage  is  not  needed  for  information 
extraction  applications. 

In  this  papa  we  will  overview  our  hybrid  architecture  and  trainable  system  components.  We  will  look 
at  examples  t^n  from  our  official  test  runs,  discuss  the  test  results  obtained  in  our  official  and  optional 
test  runs,  and  identify  promising  opportunities  fa  additional  research. 


'This  work  was  supported  by  the  Department  of  Defense  under  Contract  No.  MDA904-92-C-2390.  The 
views  expressed  in  this  papa  are  those  of  the  authors  and  should  not  be  interpreted  to  represent  opinions  or 
policies  of  the  United  States  Government. 
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TRAINABLE  LANGUAGE  PROCESSING 

Our  MUC-5  system  relies  on  two  major  tools  that  support  automated  dictionary  construction:  (1)  0TB, 
a  trainable  part-of-speech  tagger,  and  (2)  AutoSlog,  a  dictionary  construction  tool  that  operates  in 
conjunction  with  the  CIRCUS  sentrace  andyzer.  We  trained  0TB  for  EJV  on  a  subset  of  EJV  texts  and 
then  again  for  EME  using  only  EME  texts.  0TB  is  notable  for  the  high  hit  rates  it  obtains  on  the  basis  of 
relatively  little  training.  We  found  that  0TB  attained  overall  hit  rates  of  97%  after  training  on  only  1009 
sentraces  fOT  EJV.  0TB  crossed  the  97%  threshold  in  EME  after  only  621  training  sentences.  Incremental 
0TB  training  requires  human  int»action  with  a  point-and-click  interface.  Our  EJV  training  was  completed 
after  16  hours  with  the  interface;  our  EME  training  required  10  hours. 

AutoSlog  is  a  dictionary  construction  tool  that  analyzes  source  texts  in  conjunction  with  associated  key 
templates  (or  text  annotations)  in  order  to  propose  concept  node  (CN)  definitions  for  CIRCUS  [Riloff  & 
Lehnert  1993;  Riloff  1993].  A  special  inteiface  is  then  used  for  a  manual  review  of  the  AutoSlog 
definitions  in  order  to  separate  the  good  ones  fiom  the  bad  ones.  Of  3167  AutoSlog  CN  definitions 
proposed  in  response  to  1100  EJV  key  templates,  944  (30%)  were  retained  after  manual  inspection.  For 
ENffi,  AutoSlog  proposed  2952  CN  definitions  in  response  to  10(X)  key  templates  and  2275  (77%)  of  these 
were  retained  after  manual  inspection.  After  generalizing  the  miginal  definitions  with  active/passive 
transformations,  verb  tense  generalizations,  and  singular/plural  gen^alizations,  our  final  EJV  dictionary 
contained  3017  CN  definitions  and  our  final  EME  dictionary  contained  4220  CN  definitions.  It  took  20 
hours  to  manually  inspect  and  filt^  the  full  EJV  dictionary;  the  full  EME  dictionary  was  completed  in  17 
hours.  The  CIRCUS  (Uctionary  used  in  our  official  run  was  based  exclusively  on  AutoSlog  CN  definitions. 
No  hand-coded  or  manually  altered  definitions  were  added  to  the  CN  dictionary. 

When  CIRCUS  processes  a  sentence  it  can  invoke  a  semantic  feature  tagger  (MayTag)  that  dynamically 
assigns  features  to  nouns  and  noun  modifiers.  MayTag  uses  a  feature  taxonomy  based  on  the  semantics  of 
our  target  templates,  and  it  dynamically  assigns  context-sensitive  tags  using  a  corpus-driven  case-based 
reasoning  algorithm  [Cardie  93].  MayTag  operates  as  an  optional  enhancement  to  CIRCUS  sentence 
analysis.  We  ran  CIRCUS  with  MayTag  for  OV,  but  did  not  use  it  for  EME  (well  return  to  a  discussion 
of  this  and  other  domain  differences  later).  MayTag  was  trained  on  174  EJV  sentences  containing  5591 
words  (3060  c^n  class  words  and  2531  closed  class  words).  Our  tests  indicate  that  MayTag  achieves  a  74% 
hit  rate  on  general  semantic  features  (covering  14  possible  tags)  and  a  75%  hit  rate  on  specific  semantic 
features  (covering  42  additional  tags).  Interactive  training  fH  MayTag  took  14  hours  using  a  text  editor. 

An  imp(»tant  aspect  of  the  MUC-5  task  concerns  information  extraction  at  the  level  of  noun  phrases. 
Impwtant  set  fill  information  is  often  found  in  modifiers,  such  as  adjectives  and  prqx)sitional  phrases.  Part- 
of-speech  tags  help  us  identify  basic  noun  phrase  components,  but  higher-level  processes  are  needed  to 
determine  if  a  prepositional  phrase  should  be  attached,  how  a  conjunction  should  scoped,  or  if  a  comma 
should  be  crossed.  Noun  phrase  recognition  is  a  non-trivial  problem  at  this  higher  level.  To  address  the 
more  complicated  aspects  of  noun  phrase  recognition,  we  use  a  trainable  classifier  that  attempts  to  find  the 
best  termination  point  for  a  relevant  noun  phrase.  This  component  was  trained  exclusively  on  the  EJV 
corpus  and  then  used  without  alteration  for  both  EJV  and  EME.  Experiments  indicate  that  the  noun  phrase 
classifier  terminates  EJV  noun  phrases  perfectly  87%  of  the  time.  7%  of  its  noun  phrases  pick  iq)  spurious 
text  (they  are  extended  too  far),  and  6%  are  truncated  (they  are  not  extended  or  extended  far  enough).  Similar 
hit  rates  are  found  with  EME  test  data:  86%  for  exact  recognition,  with  6%  picking  up  spurious  text 
and  8%  being  truncated.  The  noun  phrase  classifier  was  trained  on  1350  EJV  noun  phrases  examined  in 
context  It  Ux^  14  hours  to  manually  mark  these  1350  instances  using  a  text  editor. 

Before  we  can  go  from  CIRCUS  output  to  template  instantiations,  we  create  intermediate  structures 
called  memory  tokens.  Memory  tokens  incorporate  cwef^ence  decisions  and  structure  relevant  information 
to  facilitate  template  generation.  Memory  tokens  record  source  strings  from  the  original  input  text,  0TB 
tags,  MayTag  features,  and  pointers  to  cotKept  nodes  that  extracted  individual  noun  phrases. 

Discourse  analysis  contributes  to  critical  decisions  associated  with  memory  tokens.  Here  we  find  the 
greatest  challenges  to  trainable  language  systems.  Thus  far,  we  have  implement^  one  trainable  component 
that  contributes  to  coreference  resolution  in  limited  contexts.  We  isolate  compound  noun  phrases  that  are 
syntactically  consistent  with  appositive  constructions  and  pass  these  NP  pairs  on  to  a  coreference  classifier. 
Since  adjacent  NPs  may  be  separated  by  a  comma  if  they  occur  in  a  list  or  at  a  clause  boundary,  it  is  easy  to 
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confuse  legitimate  appositives  with  pairings  of  unrelated  (but  adjacent)  NPs.  Appositive  recognition  is 
therefore  treated  as  a  binary  classification  probl^  that  can  be  handled  with  corpus-driven  training.  For  our 
official  MUC-S  runs  we  trained  a  classifier  to  handle  j^jpositive  recognition  using  EJV  development  texts 
and  then  used  the  resulting  classifier  for  both  EJV  and  EME.  Our  best  test  results  with  this  classifier 
showed  an  87%  hit  rate  on  EJV  appositives.  It  took  10  hours  to  manually  classify  2276  training  instances 
for  the  appositive  classifi^  using  a  training  interface. 

Our  final  tool.  TTG,  is  responsible  for  the  creation  of  template  generators  that  map  CIRCUS  output 
into  final  template  instantiations.  TTG  template  generators  are  responsible  for  the  recognition  and  creation 
of  domain  objects  as  well  as  the  insertion  of  relational  links  between  domain  objects.  TTG  is  corpus-driven 
and  requires  no  human  intavendon  during  training.  Apidication-specific  access  methods  (pathing  functions) 
must  be  hand-coded  fw  a  new  domain  but  these  can  be  added  to  TTG  in  a  few  days  by  a  knowledgeable 
technician  woridng  with  adequate  domain  documentation.  Once  these  adjustments  are  in  place,  TTG  uses 
memory  tokens  and  key  templates  to  train  classifiers  for  template  generation.  No  further  human 
int^endon  is  required  to  create  the  template  generators,  although  addidonal  testing,  tuning  and  adjustments 
are  iteeded  fa- (^)tiinal  pofamance. 

Our  hybrid  architecture  demonstrates  how  machine  learning  ciqxabilides  can  be  utilized  to  acquire  many 
different  i^ds  of  knowledge  from  a  corpus.  These  same  acquisidtHi  techniques  also  make  it  easy  to  exploit 
the  resuldng  knowledge  without  addidonal  knowledge  engineering  or  sophisdcated  reasoning.  The 
knowledge  we  can  readily  acquire  from  a  corpus  of  rqrresentadve  texts  is  limited  with  respect  to  reusability, 
but  nevertheless  cost-effecdve  in  a  system  development  scenario  predicated  on  customized  software.  The 
trainable  components  used  for  both  OV  and  EME  were  cmnpleted  after  101  hours  of  interacdve  work  by  a 
human-in-the-loop.  Moreover,  most  of  our  training  interfaces  can  be  effecdvely  opmted  by  domain  expem: 
programming  knowledge  or  familiarity  with  computadonal  linguisdcs  is  generally  not  required.^  Near  the 
end  of  this  p{q)er  will  report  the  results  of  a  system  development  expmment  that  supports  this  claim. 

There  will  always  be  a  need  for  some  amount  of  manual  programming  during  the  system  development 
cycle  fw  a  new  informadcm  extracdon  application.  Even  so,  significant  amounts  of  system  development 
t^  used  to  rely  on  experienced  programmers  have  been  shifted  over  to  trainable  language  components.  The 
ability  to  automate  knowledge  acquisidon  on  the  basis  of  key  templates  represents  a  significant 
redistribudon  of  labor  away  from  skilled  knowledge  engineers,  who  ne^  access  to  domain  knowledge, 
direcdy  to  the  domain  exp^  themselves.  By  putting  domain  experts  into  the  role  of  the  human-in-the-loq) 
we  can  reduce  dependence  on  software  technicians.  When  significant  amounts  of  system  development  wwk 
is  being  handled  by  automated  knowledge  acquisidon  and  expert-assisted  knowledge  acquisidon,  it  will 
become  increasingly  cost-effecdve  to  customize  and  maintain  a  variety  of  informadon  extracdon 
applicadons.  We  have  only  just  begun  to  explore  the  range  of  possibilides  associated  with  trainable 
language  processing  systems. 

The  hybrid  architecture  undalying  our  official  MUC-S  systems  was  less  than  six  months  old  at  the 
time  of  the  evaluadon,  and  most  of  die  trainable  language  components  that  we  utilized  w^e  less  than  a  year 
old.  Less  than  24  po'son/months  were  expended  for  both  of  the  EJV  and  EME  systems,  although  this 
estimate  is  confounded  by  the  fact  that  trainable  components  and  their  associated  interfaces  were  being 
designed,  implemented,  arid  tested  by  the  same  people  responsible  for  our  MUC-S  system  development.  The 
creation  of  a  trainable  system  component  represents  a  one-time  system  development  investment  that  can  be 
tqiplied  to  subsequent  systems  at  much  less  ov^head. 

Figure  1  oudines  the  basic  flow-of-control  through  the  major  components  of  the  UMass/Hughes  MUC-S 
system.  Note  that  most  of  the  trainable  components  dqiend  only  on  the  texts  from  the  development  corpus. 
The  concept  node  dictionary  and  the  trainable  template  genoator  also  rely  on  answer  keys  during  training.  In 
the  case  of  the  concept  node  dictionary,  we  have  been  able  to  drive  our  dictionary  construction  process  on 
the  basis  of  annotated  texts  created  by  using  a  point-and-click  text  marking  interface.  So  the  substantial 
overhead  associated  with  creating  a  large  collection  of  key  templates  is  not  needed  to  support  automated 
dictionary  construction.  However,  we  do  not  see  how  to  support  trainable  template  generation  without  a  set 
of  key  templates,  so  this  one  trainable  component  requires  a  significant  investment  with  respect  to  labor. 


^  Some  technical  b{x:kground  is  needed  to  train  OTB.  Knowledge  of  our  part-of-speech  tags  is  needed  for  that 
interface. 
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Figure  1;  System  Architecture 
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THE  OFFICIAL  TEST  RUNS 


Our  official  test  runs  were  conducted  on  four  DECstations  running  Allegro  Common  Lisp.  The  runs 
went  smoothly  in  EME  but  we  encountered  one  fatal  error  in  one  of  the  EJV  test  sets.  Portions  of  our 
official  EJV  and  EME  sctxe  rqwrts  are  shown  in  figures  2  and  3. 
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Figure  2:  Official  EJV  Score  Report  Summaries 
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Figure  3:  Official  EME  Score  Report  Summaries 


THE  OPTIONAL  TEST  RUNS 

We  ran  optional  tests  to  see  what  sort  of  recall^recision  trade-offs  were  available  from  the  system. 
Since  the  template  generator  is  a  set  of  classifiers,  and  each  classifi^  outputs  a  certainty  associated  with  a 
hypothesized  template  fragment,  we  have  many  parameters  that  can  be  manipulated.  Raising  the  threshold 
on  the  certainty  for  a  hypothesis  will,  in  most  cases,  increase  precision  and  reduce  recall.  In  the 
experiments  reported  here,  we  have  varied  the  paramet^s  over  broad  classes  of  discrimination  trees.  There 
are  three  important  classes  of  decision  tree:  (1)  trees  that  filter  the  creation  of  objects  based  on  string  fills, 
(2)  trees  that  filter  the  creation  of  objects  ba^  on  set  fills,  and  (3)  trees  that  hypothesize  relations  among 
objects.  An  example  of  the  first  class  is  the  tree  that  filters  the  CIRCUS  output  for  entity  names  in  the 
EJV  domain.  An  example  of  the  second  class  is  the  tree  that  filters  possible  lithography  objects  based  on 
evidence  of  the  type  of  lithography  process.  The  trees  that  hypothesize  TIE_UP_RELATIONSHIP's  and 
ME_CAPABILITY's  are  examples  of  the  third  class. 

For  these  experiments  we  have  varied  the  certainty  thresholds  for  all  trees  of  a  given  class.  Figure  4 
shows  the  trade-off  achieved  for  EME. 
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Figure  4:  Trade-off  curve  for  EME 


80 

70 


c  60 


Z  4  0 
£  30 
“■  20 


lOf 

0  -I - 1 - 1 - 1 - \ 

0  10  20  30  40 

Recall 


Figure  5:  Trade-off  curve  fw  EJV 


This  trade-off  curve  was  achieved  by  varying,  in  concert  the  thresholds  on  all  three  classes  of 
discrimination  tree  from  0.0  to  0.9.  Figure  5  shows  the  trade-off  curve  achieved  in  EJV.  The  difference 
between  the  two  curves  highlights  difference  between  the  two  domains  and  between  the  system 
configurations  used  for  the  two  domains.  The  EME  curve  shows  a  much  more  dramatic  trade-off.  The  EJV 
curve  shows  that  only  modest  varying  of  recall  and  precision  is  achievable.  Part  of  this  is  a  reflection  of 
the  two  domains.  In  EJV,  most  relationships  were  found  via  two  noun  phrases  that  shared  a  common  CN 
trigger.  This  method  p-oved  to  be  effective  at  detecting  relationships.  Therefore  the  only  real  difference  in 
the  trade-off  comes  from  varying  the  thresholds  for  the  string-fill  and  set-fill  trees,  which  generate  the 
objects  that  are  then  composed  into  relationships.  In  EME,  there  not  nearly  as  many  shared  triggers  and  so 
the  template  generator  must  attempt  intelligent  guesses  for  relations.  The  probabilistic  guesses  made  in 
EME  are  much  more  amenable  to  threshold  manipulation  than  the  mwe  structured  information  used  in  EJV. 
Also,  in  EJV  the  system  ran  with  a  slot  masseur  that  embodied  some  domain  knowledge.  In  EJV,  TTG 
was  configured  to  only  hypothesize  objects  if  the  slot  masseur  had  found  a  reasonable  slot-fill  or  set-fill. 
This  use  of  domain  knowledge  further  limited  the  efficacy  of  changing  certainty  thresholds. 
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TRAINABLE  INFORMATION  EXTRACTION  IN  ACTION 

Before  CIRCUS  can  tackle  an  input  sentence,  we  have  to  pass  the  source  text  through  a  preprocessor 
that  locates  sratence  boundaries  and  rewwks  the  source  text  into  a  list  structure.  The  preprocessor  replaces 
punctuation  marks  with  special  symbols  and  applies  text  processing  specialists  to  pick  up  dates,  locations, 
and  other  objects  of  intoest  to  the  target  domain.  We  use  the  same  preprocessing  speci^ists  for  both  EJV 
and  EME:  many  specialists  will  ^ly  to  multiple  domains.  A  subset  of  the  Gazetteer  was  used  to  support 
the  location  speci^st,  but  no  otho*  MRDs  are  used  by  the  preprocessing  specialists.  We  do  not  have  a 
specialist  that  attempts  to  recognize  ccmpany  names. 

Sourc*  Text : 

BRIDGESTONE  SPORTS  CO.  SAID  FRIDAY  IT  HAS  SET  OP  A  JOINT  VENTURE  IN  TAIWAN  WITH  A 
LOCAL  CONCERN  AND  A  JAPANESE  TRADING  HOUSE  TO  PRODUCE  GOLF  CLUBS  TO  BE  SHIPPED  TO 
JAPAN . 

PraprooassAd  T«xt: 

(*STARr  BRIDGESTONE  SPORTS  CO=  SAID  "ON  241 189  IT  HAS  SET  UP  A  JOINT  VENTURE  IN  @<g)  Taiwan  WITH  A 
LOCAL  CONCERN  AND  A  JAPANESE  TRADING  HOUSE  TO  PRODUCE  GOLF  CLUBS  TO  BE  SHIPPED  TO  @@  Japan 
•END*)  - 

Note  that  the  date  specialist  had  to  consult  the  dateline  of  the  source  text  in  order  to  determine  that 
"Friday"  must  lefn  to  Novemba*  24, 1989.  Once  the  preprocessor  has  completed  its  analysis,  the  0TB  part- 
of-speech  tagger  identifies  parts  of  speech: 

(♦START*  BRIDGESTONE  SPORTS  CX)=  SAID  ♦♦ON_241189  IT  HAS  SET  UP  A  JOINT 

strt  nm  nm  noun  verb  $date$  noun  aux  pasp  ptcl  art  nm 

VENTURE  IN  ©©JTaiwan  WTHI  A  LOCAL  CONCERN  AND  A  JAPANESE  TRADING 

noun  prep  $locatlon$  prep  art  nm  noun  conj  art  nm  nm 

HOUSE  TO  raOJUCE  GCXJ’  CLUBS  TO  BE  SHIPPED  TO  @@_Japan  *END*) 

noun  Inf  verb  nm  noun  inf  aux  pasp  prep  $location$  stop 

0TB  tagged  97.1%  of  the  words  in  EJV  0592  correctly.  One  error  associated  with "...  A  company 
ACTIVE  IN  TRADING  WITH  TAIWAN ..."  led  to  a  truncated  noun  phrase  when  "active"  was  tagged  as  a  head 
noun  instead  of  a  nominative  predicate. 

With  part-of-speech  tags  in  place,  CIRCUS  can  begin  selective  concept  extraction.  On  this  first 
sentence  from  EJV  0592,  CIRCUS  triggers  18  CN  definitions  higgled  by  the  words  "said"  (3  CNs),  "set" 
(3  CNs),  "venture"  (9  (Us),  "produce"  (1  CU),  and  "shipped"  (2  (Us).  These  CNs  extract  a  number  of  key 
noun  phrases,  and  assign  semantic  features  to  these  noun  phrases  based  on  soft  constraints  in  the  CN 
definition.  Some  of  these  features  were  recognized  to  be  inconsistent  with  the  slot  fill  and  others  were 
deemed  acceptable.  Notice  that  diftoent  CNs  picked  up  "Bridgestone  sports  co."  with  incompatible 
smandc  features  (it  was  associated  with  both  a  pint  venture  and  a  pint  venture  parent  feature). 


extracted  NPs 

rejected  (U  features: 

accepted  (U  features: 

BRIDGESTONE  SPORTS  CO» 

IT 

A  JOINT  VENTURE 

GOLF  CLUBS 

jv-person 

jv-porson 

jv-entity,  jv-parent,  jv 

jv-enti^,  company 
jv-prooLserv,  production 

As  we  can  see  from  this  sentence,  CN  feature  types  are  not  always  reliable,  and  CIRCUS  does  not 
always  recognize  the  violation  of  a  soft  feature  constraint.  An  independent  set  of  semantic  features  are 
obtained  from  MayTag.  In  the  first  sentence  of  EJV  0592,  MayTag  only  missed  marking  "golf  clubs"  as  a 
product/s^ice.  Art  indqiendent  set  of  semantic  features  are  obt^ed  frcxn  MayTag: 
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words 

♦START* 

MavTag  semantic  features 

hits  &  misses 

BRIDGESTONE 

( (WS-JV-ENTITY)  (WS-COMPANY-NAME) ) 

;  correct 

SPORTS 

( (WS-JV-ENTITY)  (WS-COMPANY-NAME) ) 

;  correct 

CO= 

SAID 

( (WS-JV-ENTITY)  (WS-GENERIC-COMPANY) ) 

;  correct 

IT 

HAS 

SET 

UP 

A 

(  (WS-ENTITY)  NIL) 

;  correct 

JOINT 

(  (WS-JV-ENTITY)  NIL) 

;  correct 

VENTURE 

(  (WS-JV-ENTITY)  NIL) 

;  correct 

IN 

JV-LOCATION 

WITH 

( (WS-LOCATION)  NIL) 

;  correct 

A 

LOCAL 

( (WS-ENTITY)  NIL) 

;  correct 

CONCERN 

AND 

A 

( (WS-JV-ENTITY)  NIL) 

;  correct 

JAPANESE 

( (WS-NATIONALITY)  NIL) 

;  correct 

TRADING 

( (WS-PRODUCT-SERVICE)  (WS-SALES) ) 

;  correct 

HOUSE 

TO 

PRODUCE 

(  (WS-JV-ENTITY)  NIL) 

;  correct 

GOLF 

( (WS-ENTITY)  NIL) 

;  incorrect 

CLUBS 

TO 

BE 

SHIPPED 

TO 

(  (WS-ENTITY)  NIL) 

;  incorrect 

JV-LOCATION 

( (WS-LOCATION)  NIL) 

;  correct 

In  addition  to  extracting  some  noun  phrases  and  assigning  semantic  features  to  those  noun  phrases,  we 
also  call  the  noun  phrase  classifia  to  see  if  any  of  the  simple  NPs  picked  up  by  the  CN  deHnitions  should 
be  extended  to  longer  NPs.  For  this  sentence,  the  noun  phrase  classifier  extended  only  one  NP:  it  decided 
that  "A  JOINT  VENTURE"  should  be  extended  to  pick  up  "A  JOINT  venture  in  Taiwan  with  a  local 
CONCERN  AND  A  JAPANESE  TRADING  HOUSE".  The  second  prepositional  phrase  should  not  have  been 
included  -  this  is  an  NP  expansion  that  was  overextended. 

Each  noun  phrase  extracted  by  a  CIRCUS  concqpt  node  will  eventually  be  preserved  in  a  memwy  ttdten 
that  records  the  CN  features,  MayTag  features,  any  NP  extensions,  and  other  information  associated  with 
CN  definitions.  But  before  we  look  at  the  memory  tokens,  let's  briefly  review  the  other  NPs  that  are 
extracted  from  the  remainder  of  the  text  For  each  preprocessed  sentence  produced  in  response  to  EJV  0592, 
we  will  put  the  noun  phrases  extracted  by  CIRCUS  into  boldface  and  use  underlines  to  indicate  how  the 
noun  i^il^e  classifier  extends  some  of  these  NPs. 

"START*  BRIDGESTONE  SPORTS  CO=  SAID  **ON_241 189  il  HAS  SET  UP  A  JOINT  VENTURE  IN  @(8)  Taiwan  WITH  A 
LOCAL  CONCERN  AND  A  JAPANESE  TRADING  HOUSE  TO  PRODUCE  GOLf  CLUBS  TO  BE  SHIPPED  TO  <g>@_Japan 
•END*)  rSTARr  THE  JOINT  VENTURE  SCOMMAS  BRIDGESTONE  SPORTS  TAIWAN  CO=  $COMMA$  CAPITALIZED 
AT  M2000(10(10  TWD  $COMMA$  WILL  START  PRODUCTION  **DURING_0190  WITH  PRODUCTION  OF  AA20000  IRON 
AND  METAL  WOOD  CLUBS  A  MONTH  *END*)  (*STARr  THE  MONTHLY  OUTPUT  WILL  BE  LATER  RAISED  TO 
&&50000  UNITS  $COMMA$  BRIDGESTON  SPORTS  OFFICIALS  SAID  *END*)  (*STARr  THE  NEW  COMPANY 
SCOMMAS  BASED  IN  KAOHSIUNG  SCOMMAS  SOUTHERN  TAIWAN  $COMMA$  IS  OWNED  %%7S  BY  BRIDGESTONE 
SPORTS  SCOMMAS  %%1S  BY  UNION  PREOSION  CASTING  CO=  OF  @<a>  Taiwan  AND  THE  REMAINDER  BY  TAgA 
CO=  SCOMMAS  A  COMPANY  ACTIVE  IN  TRADING  WITH  TAIWAN  SCOMMAS  THE  OFFICIALS  SAID  *END*)  (*STARr 
BRIDGESTONE  SPORTS  HAS  SO_FAR  BEEN  ENTRUSTING  PRODUCTION  OF  GOLF  CLUB  PARTS  WITH  UNION 
PRECISION  CASTING  AND  OTHER  TAIWAN  COMPANIES  *END*1  (*STARr  WITH  THE  ESTABLISHMENT  OF  THE 
TAIWAN  UNIT  SCOMMAS  THE  JAPANESE  SPORTS  GOODS  MAKER  PLANS  TO  INCREASE  PRODUCTION  OF 
LUXURY  CLUBS  IN  (3X5)  Japan  *END*) 


284 


As  far  as  our  CN  dictionary  covoage  is  concmied,  we  were  able  to  identify  all  of  the  relevant  noun 
phrases  needed  with  the  exception  of  'A  LOCAL  CONCERN  AND  A  JAPANESE  TRADING  HOUSE'  which  should 
have  been  picked  up  by  a  JV  parent  CN.  In  fact,  our  AutoSlog  dictionary  had  two  such  definitions  in  place 
fen:  exactly  this  type  of  cmstruction,  but  neitho-  definition  was  able  to  complete  its  instantiation  because  of 
a  previously  unsown  problem  with  time  stamps  inside  CIRCUS.  This  was  a  processing  failure  -  not  a 
dictionary  failure. 

Trainable  noun  phrase  analysis  processes  13  of  the  17  NP  instances  marked  above  correctly.  Three  of 
the  NPs  were  expanded  too  far,  and  one  was  expanded  but  not  quite  far  oiough  due  to  a  tagging  oror  by 
0TB  ("a  company  active ...").  An  inspection  of  the  13  correct  instances  reveals  that  7  of  these  would  have 
been  correctly  terminated  by  simple  heuristics  based  on  part-of-speech  tags.  It  is  important  to  note  that  the 
trainable  NP  analyzer  had  to  d^uce  these  more  "obvious"  heuristics  in  the  same  way  that  it  deduces 
decisions  for  more  complicated  decisions.  It  is  encouraging  to  see  that  straightforward  heuristics  can  be 
acquired  automatically  by  trainable  classifiers.  Whra  our  analyzer  makes  a  mistake,  it  generally  happens 
with  the  more  complicated  noun  phrases  (which  is  wh^  hand-coded  heuristics  tend  to  break  down  as  well). 

After  the  noun  phrase  classifier  has  attempted  to  find  the  best  termination  points  for  the  relevant  NPs, 
we  then  call  the  coreference  classifier  to  consider  pairs  of  adjacent  NPs  separate  by  a  comma.  In  this  text 
we  find  three  such  ai^sitive  candidates  (the  second  of  which  contains  an  extended  that  was  not  properly 
tominated): 

THE  jaNT  VENTURE.  BRIDGESTONE  SPORTS  TAIWAN  CO. 

TAGA  CO..  A  COMPANY  ACTIVE 

THE  NEW  COMPANY.  BASED  IN  KAOHSIUNG.  SOUTHERN  TAIWAN 

In  the  third  case,  the  location  specialist  failed  to  recognize  either  Kaohsiung  or  Southern  Taiwan  as 
names  of  locations.  On  the  other  hand,  the  fragment  "based  in  Kaohsiung"  was  recognized  as  a  location 
description  and  therefore  reformatted  it  as  "THE  new  company  (%based-in%  kaohsiung),  southern 
TAIWAN"  which  set  up  the  entire  construct  as  an  appositive  candidate.  The  coreference  classifier  then  went 
on  to  accept  each  of  these  three  instances  as  valid  appositive  constructions.  This  was  the  right  decision  in 
the  first  two  cases,  but  wrong  in  the  third.  If  full  location  recognition  had  been  working,  this  last  instance 
would  have  nsva  been  handed  to  the  coreference  classifier  in  the  first  place. 

The  coreference  classifier  teUs  us  when  adjacent  noun  phrases  should  be  merged  into  a  single  memtxy 
token.  We  also  invoke  some  hand-coded  heuristics  fw  coreference  decisions  that  can  be  handled  on  the  basis 
of  lexical  features  alone.  These  heuristics  determine  that  Bridgestone  Sports  Co.  is  coreferent  with 
Bridgestone  Spwts,  and  that  "the  joint  venture„  Bridgestone  sports  Taiwan  co."  is  coreferent  with  "a 
JOINT  venture  in  TAIWAN ..."  Our  lexical  coreference  heuristics  are  nevertheless  vary  cons^ative,  so  they 
fail  to  merge  our  four  product  service  instances  in  spite  of  the  fact  that  "clubs"  appears  in  three  of  these 
string  fills.  In  effect,  we  pass  the  following  memory  token  output  to  TTG: 

S  recognized  companies  (#4  and  #5  should  have  been  m^ed): 

f  1 )  "TAGA  CO='  aka  'A  COMPANY  ACTIVE* 

I  arUNION  PRECISION  CASTING  CO=  OF  @@  TAIWAN" 

|3) 'BRIDGESTONE  SPORTS  CO=*  aka 'BRIDGESTONE  SPORTS" 

|  4)  "THE  NEW  COMPANY  (%BASED-IN%  KAOHSIUNG)'  aka  "SOUTHERN  TAIWAN" 

(5)  THE  JOINT  VENTURE^  aka  "BRIDGESTONE  SPORTS  TAIWAN  CO='  aka 

"A  JOINT  VENTURE  IN  @@_TAIWAN  WITH  A  LOCAL  CONCERN  AND  A  JAPANESE  TRADING  HOUSE' 

4  pxiduct  sovice  strings  (all  of  these  should  have  been  merged): 

(6)  "GOLF  CLUBS" 

(7)  '&&20000  IRON  AND  METAL  WOOD  CLUBS* 

8  "GOLF  CLUB  PARTS  WITH  UNION  PRECISION  CASTING  AND  OTHER  TAIWAN  COMPANIES" 

(9)  "LUXURY  CLUBS  IN  (g)(g»_JAPAN* 

1  ownoship  and  2  percent  objects : 

We  failed  to  extract  "the  remainder  by  ..."  for  the  third  ownership  object 
because  our  percentage  specialist  was  not  watching  for  verbal  referents 
in  a  percentage  context  -  this  could  be  fixed  with  an  adjustment  to  the 
specialist. 


(10)  '$$20000000_TWD" 

(11) '%%15' 

(12) "%%75" 
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When  TTG  receives  memory  tokens  as  input,  the  object  existence  classifiers  try  to  filter  out  spurious 
information  picked  iq)  by  overzealous  CN  definiticms.  Unfortunately,  in  the  case  of  0592,  TTG  filtered  out 
two  good  memory  tokens:  (#1  describing  the  parrat  Tago  Co.),  and  (#S  describing  the  joint  venture).  It  was 
particularly  damaging  to  throw  away  #S  because  that  memory  token  contained  the  correct  company  name 
(Bridgestone  Sports  Taiwan  Co.).  Of  the  3  remaining  memory  tokois  describing  companies,  TTG  correctly 
identified  the  two  parent  companies  on  the  basis  of  semantic  features,  but  then  it  was  forced  to  pick  up  #4 
as  the  child  company.  Our  pathing  function  was  smart  enough  to  know  that  the  new  company*  was  prob^ly 
not  a  good  company  name,  but  that  left  us  with  -southern  Taiwan*  for  the  company  name.  So  a  fafiure  that 
started  with  location  recognition  led  to  a  mistake  in  trainable  appositive  recognition,  which  then  combined 
with  a  failure  in  lexical  coieference  recognition  and  a  filtering  error  by  TTG  in  order  to  give  us  a  joint 
ventiue  named  "southern  Taiwan"  instead  of  "BRIDGESTON  sports  Taiwan."  OvCTly  aggressive  filtering 
by  TTG  resulted  in  the  loss  of  our  4  product  service  memory  tokens. 

Our  CN  instantiations  do  not  explicitly  represent  relational  inframation,  but  CNs  that  share  a  ctxnmon 
trigger  word  can  be  counted  on  to  link  two  CN  instantiaticms  in  some  kind  of  a  relationship.  Trigger 
families  can  reliably  tell  us  when  two  entities  are  related,  but  they  can't  tell  us  what  that  relationship  is.  We 
relied  on  TTG  to  deduce  specific  relationships  on  the  basis  of  its  training.  In  cases  like  "75%  BY 
BRIDGESTONE  SPORTS',  TTG  had  no  trouble  linking  extracted  percentage  objects  with  companies.  But  our 
trainable  link  recognition  ran  into  more  difficulties  when  trigger  families  contained  multiple  companies 
Among  the  features  that  TTG  had  available  for  discrimination  wh^e  closed  class  features,  such  as  memory 
token  types,  semantic  features,  and  CN  patterns,  and  open  class  features  (i.e.  trigger  words).  However, 
although  there  exist  heuristics  for  discriminating  relationships  based  on  particular  words,  the  combination 
of  the  algorithms  used  (IDS)  and  the  amount  of  data  (6(X)  stoies)  failed  to  induce  these  heuristics.  The  may 
be  other  algorithms,  however,  that  used  the  same  or  less  data  and  external  knowledge  to  derive  such 
heuristics  from  the  training  data. 

The  processing  for  EME  proceeds  very  similarly  to  EJV,  with  the  exception  that  MayTag  is  not  used 
in  our  Et^  configuration,  and  in  the  EME  system  we  used  our  standard  CN  mechanism  and  an  additional 
keyword  CN  (KCN)  mechanism.  The  KCN  mechanism  was  used  to  recognize  specific  types  of  processing, 
equipment,  and  devices  that  have  one  or  only  a  few  possible  manifestations.  Below  we  see  the  0TB  tags 
for  the  first  sentence,  all  of  which  are  correct  In  fact,  fw  EME  text  2789568, 0TB  had  1(X)%  hit  rate. 

♦start*  **DURING_2Q91  $COMMA$  Nikon  Coip=  $LPAREN$  &&7731 

strt  $date$  punc  nm  noun  punc  noun 

$RPAREN$  plans  to  market  the  NSR-1755EX8A  $COMMA$  a  new  stepper 

punc  verb  inf  verb  art  noun  punc  art  nm  noun 

intended  for  use  in  the  production  of  &&64-  Mbit  DRAMs  *end* 

pasp  picp  noun  prqr  art  noun  prqr  nm  nm  noun  stop 

The  memory  token  structure  below  illustrates  the  processing  of  the  text  prior  to  TTG.  Two  NPs  are 
identified  as  the  same  entity,  "Nikon"  and  "Nikon  Corp."  The  two  NPs  are  merged  into  me  memoy  token 
based  on  name  merging  heuristics.  The  second  NP  demonstrates  how  multiple  recognition  mechanisms  can 
add  robustness  to  the  processing.  "Nikon  Corp."  is  picked  up  by  both  a  CN  triggo-ed  off  of  "plans  to 
market"  and  by  two  K^s,  one  that  looks  for  "Corp."  and  another  that  looks  for  the  lead  NP  in  the  story. 

(TOKEN 

(TYPE  (ME-ENTITY) ) 

(SUBTYPE  NIL) 

(RELATION  NIL) 

(SLOT-FILLS 

(TYPE  COMPANY) 

(NAME  (:SYM-LIST  NIKON  CORP=) ) ) 

(NPS 

(NP  2  1  (NIKON) 

(CNS  %ME-ENTITY-NAME-SUBJECT-VERB-AND-DO-STEPPER%) ) 

(NP  0  3  (NIKON  CORP=) 

(CNS  %ME-ENTITY-NAME-SUB JECT-VERB-AND-INFINITIVE-PLANS-TO-MARKET% ) 

(KCNS  %KEYWORD-ME-ENTITY-CORP=% 

%LEAD-NP%) ) ) ) 
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Unfortunately,  our  system  did  not  get  any  lithography  objects  for  this  story.  On  our  list  of  things  to 
get  to  if  time  permitted  was  creating  a  lithography  object  for  an  otherwise  orphan^  steppe.  We  would  have 
only  gotten  one  lithography  object  since  we  maged  ail  mentions  of  "stqrper"  into  one  memory  token. 

We  created  a  synthetic  version  of  the  system  that  inserted  a  lithography  memory  token  corresponding  to 
each  stqrpCT.  One  was  discard  by  TTG  and  anodic  was  created  because  there  ware  two  different  equipment 
objects  attached  to  the  remaining  lithography  object.  The  features  that  TTG  used  to  hypothesize  a  new 
ME^CAPABILITY  are  illustrative  of  one  of  the  weaknesses  of  this  particular  method.  TTG  used  the 
following  features  to  decide  not  to  genoate  an  ME.CAPABILITY  developer. 


FEATURE 

RELATION  CERTAINTY  AFTER  FEATURE 

0.36 

0.23 

The  entity  is  not  triggered  off  "developed" 

0.14 

The  process  is  not  CVD 

0.03 

The  process  is  not  LTTHCXjRAPHY  of  UKN  type 

0.04 

The  process  is  not  ETCHING 

0.06 

The  entity  is  ixrt  triggoed  off  "fircMn" 

0.12 

All  of  the  features  are  negative,  and  the  absence  of  each  feature  reduces  the  certainty  that  the  relation 
holds,  because  each  feature's  presence,  broadly  speaking,  is  positive  evidence  of  a  relation.  Therefore,  the 
node  of  the  decision  tree  that  is  found  is  a  grouping  of  cases  that  have  no  particular  positive  evidence  to 
support  the  relation,  but  also  no  negative  evidetK:e.  With  the  relation  threshold  set  at  0.3,  this  yields  a 
negative  identification  of  a  relation.  However,  there  are  strong  indications  of  a  relation  here.  For  example, 
the  trigga  "plans  to  market"  is  good  evidence  of  a  relation,  however,  the  nature  decision  tree  algorithms 
(recursively  splitting  the  training  data)  causes  us  to  loose  that  feature  (in  favor  of  other,  better  features). 
The  fdlowing  set  of  features  shows  what  TTG  used  to  gen^te  an  ME.CAPABILITY  distributo'. 


FEATURE 

RELATION  CERTAINTY  AFTER  FEATURE 

0.38 

0.47 

The  entity  is  not  in  a  PP 

0.58 

A  CN  marked  the  entity  as  an  entity 

0.55 

■  f  r.4  li  1  w;  1  n  vw » 1 1  inci 

0.40 

Again,  we  do  not  see  here  the  features  that  we  would  expect,  given  the  text.  A  human  generating  rules 
would  say  that  "plans  to  market"  is  a  good  indication  of  a  ME_CAPABILITY  distribute. 


DICTIONARY  CONSTRUCTION  BY  DOMAIN  EXPERTS 

Sites  participating  in  the  recent  message  understanding  conferences  have  increasingly  focused  their 
research  on  developing  methods  for  automated  knowledge  acquisition  and  tools  for  human-assisted 
knowledge  engineering.  Howevo-,  it  is  important  to  remember  that  the  ultimate  users  of  these  tools  will  be 
domain  experts,  not  natural  language  processing  researchers.  Domain  experts  have  extensive  knowledge 
about  the  task  and  the  domain,  but  will  have  little  or  no  background  in  linguistics  or  text  processing.  Tools 
that  assume  familiarity  with  computational  linguistics  will  be  of  limited  use  in  practical  development 
scenarios. 

To  investigate  practical  dictioruuy  consbuction,  we  conducted  an  experiment  with  government  analysts. 
We  wanted  to  demonstrate  that  domain  experts  with  no  background  in  text  [vocessing  could  successfully  use 
the  AutoSlog  dictionary  construction  tool  [Riloff  and  Lehnert  1993].  We  compared  the  dictionaries 
constructed  by  the  government  analysts  with  a  dictionary  consUiicted  by  a  UMass  researcher.  The  results  of 
the  expoiment  suggest  that  domain  exp^  can  succes^ully  use  AutoSlog  with  only  minimal  training  and 
achieve  poformance  levels  comparable  to  NLP  researchers. 
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AutoSlog  is  a  system  that  automatically  ccxistructs  a  dictionary  for  information  extraction  tasks.  Given 
a  training  corpus,  AutoSlog  proposes  domain-specific  concept  node  definitions  that  CIRCUS  [Lehnert 
1991]  uses  to  extract  information  from  text.  However,  many  of  the  definitions  proposed  by  AutoSlog 
should  not  be  retained  in  the  pomanent  dictionary  because  they  are  useless  or  too  risky.  We  thoofore  rely 
on  a  human-in-the-loop  to  manually  skim  the  definitions  proposed  by  AutoSlog  and  separate  the  good  ones 
fnxn  the  bad  ones. 

Two  government  analysts  agreed  to  be  the  subjects  of  our  experiment.  Both  analysts  had  generated 
templates  for  the  joint  ventures  domain,  so  they  were  expats  with  tte  EJV  domain  and  Ae  template-filling 
task.  Neither  analyst  had  any  background  in  linguistics  or  text  processing  and  had  no  previous  experience 
with  our  system.  Before  they  began  using  the  AutoSlog  interface,  we  gave  them  a  l.S  hour  tutorial  to 
explain  how  AutoSlog  works  and  how  to  use  the  interface.  The  tutorial  included  some  examples  to 
hi^light  important  issues  and  genoal  decision-making  advice.  Finally,  we  gave  each  analyst  a  set  of  1S7S 
concq>t  node  definitions  to  review.  These  included  definitions  to  extract  8  types  of  information:  jv-entities, 
facilities,  person  names,  product/service  descriptions,  ownership  percentages,  total  revenue  amounts, 
revenue  rate  amounts,  and  ownoship  capitalization  amounts. 

We  did  not  give  the  analysts  all  of  the  concept  node  definitions  proposed  by  AutoSlog  for  the  EJV 
domain.  AutoSlog  actually  proposed  3167  caacespt  node  definitions,  but  the  analysts  were  only  available  for 
two  days  and  we  did  not  expect  them  to  be  able  to  review  3167  definitions  in  this  limited  time  frame.  So 
we  created  an  "abridged"  version  of  the  dictionary  by  eliminating  jv-oitity  and  product/service  pattons  that 
appeared  only  infrequently  in  the  corpus.^  The  resulting  “abridged”  dictionary  contained  1575  concept  node 
definitions. 

We  compared  the  analysts'  dictionaries  with  the  dictionary  generated  by  UMass  for  the  final  Tipster 
evaluation.  However,  the  official  UMass  dictionary  was  bas^  on  the  complete  set  of  3167  definitions 
originally  proposed  by  AutoSlog  as  well  as  definitions  that  were  spawned  by  AutoSlog's  optional 
generalization  modules.  We  did  not  use  the  generalization  modules  in  this  experiment,  due  to  time 
constraints.  To  create  a  comparable  UMass  dictionary,  we  removed  all  of  the  "generalized"  definitions  from 
the  UMass  dictionary  as  well  as  the  definitions  that  wo-e  not  among  the  1575  given  to  the  analysts.  The 
resulting  UMass  dictionary  was  a  much  smaller  subset  of  the  official  UMass  dictionary. 

Analyst  A  took  approximately  12.0  hours  and  Analyst  B  took  approximately  10.6  hours  to  filter  their 
respective  dictionaries.  Figure  6  shows  the  number  of  definitions  that  each  analyst  kept,  separated  by  types. 
For  comparison's  sake,  we  also  show  the  breakdown  for  the  smallo*  UMass  dictionary. 


CNType 

#  proposed  by 
AutoSlog 

#kept 

(UMass) 

#kept 
(Analyst  A) 

#  kept 
(Analyst  B) 

entity 

688 

311 

357 

423 

facility 

80 

20 

16 

55 

ownership-percent 

174 

91 

117 

91 

poson 

243 

119 

149 

52 

prod_serv 

316 

76 

152 

44 

revoiue-rate 

19 

14 

12 

16 

revenue-total 

30 

22 

15 

26 

total-capitalization 

25 

14 

13 

22 

TOTAL 

1575 

667 

831 

729 

Figure  6:  Comparative  dictionary  sizes 


^While  processing  the  training  corpus,  AutoSlog  keeps  track  of  the  number  of  times  that  it  proposes  each 
definition  (it  may  propose  a  definition  more  than  once  if  the  same  pattern  appears  multiple  times  in  the 
corpus).  We  removed  all  jv-entity  definitions  that  were  proposed  <  2  times  and  all  product/service 
definitions  that  were  proposed  <  3  times.  We  eliminated  jv-entity  and  product/service  definitions  only 
because  the  sheer  number  of  these  definitions  overwhelmed  the  other  types. 
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We  compared  the  dictionaries  constructed  by  the  analysts  with  the  UMass  dictionary  in  the  following 
manner.  We  todc  the  official  UMass/Hughes  system,  removed  the  official  UMass  dictionary,  and  replaced  it 
with  a  new  dictionary  (the  smaller  UMass  dictionary  or  an  analysts’  dictionary).  One  complication  is  that 
the  UMass/Hughes  system  includes  two  modules,  'nO  and  Maytag,  that  use  the  concept  node  dictionary 
during  training.  In  a  clean  expmmratal  design,  we  should  ideally  retrain  these  components  for  each  new 
dictionary.  We  did  retrain  the  template  genoator  (TTG).  but  we  did  not  retrain  Maytag.  We  expect  that  this 
should  not  have  a  significant  impact  on  the  relative  performances  of  the  dictionaries,  but  we  are  not  certain 
of  its  exact  impact.  Finally,  we  scored  each  new  version  of  the  UMass/Hughes  system  on  the  Tips3  test 
set  Figure  7  shows  the  results  for  each  dictionary. 


TIPS3 

Recall 

Precision 

P&R 

ERR 

UMass/Hughes 

18 

51 

27.06 

83 

Analyst  A 

19 

47 

27.39 

83 

Analyst  B 

20 

47 

27.89 

83 

Figure  7:  Comparative  scores  for  Tips3 

The  F-measures  (P&R)  woe  extremely  close  across  all  3  dictionaries.  In  fact,  both  analysts’  dictionaries 
achieved  slightly  higher  F-measures  than  the  UMass  dictionary.  The  error  rates  (ERR)  for  all  three 
dictionaries  were  identical.  But  we  do  see  some  variation  in  the  recall  and  precision  scores.  We  also  see 
variations  when  we  score  the  three  parts  of  Tips3  separately  (see  Figure  8). 


1  TIPS3/Partl 

Recall 

Precision 

P&R 

ERR 

18 

51 

27.04 

83 

Analyst  A 

20 

48 

28.00 

82 

Analyst  B 

22 

47 

29.69 

81 

1  TIPS3/Part2 

Recall 

Precision 

P&R 

ERR 

17 

52 

26.03 

84 

Analyst  A 

18 

48 

25.92 

84 

Analyst  B 

20 

47 

27.75 

83 

|TIPS3/Part3 

Recall 

Precision 

P&R 

ERR 

20 

50 

28.12 

82 

Analyst  A 

20 

46 

27.96 

82 

Analyst  B 

17 

48 

25.25 

84 

Figure  8:  Comparative  scores  fcx'  Parti,  Part2,  and  Part3 

In  general,  the  analysts’  dictionaries  achieved  slightly  high^  recall  but  low^  precision  than  the  UMass 
dictionary.  We  hypothesize  that  this  is  because  the  UMass  research^  was  not  v^  familiar  with  the  corpus 
and  was  Aerefore  somewhat  conservative  about  keeping  definitions.  The  analysts  were  much  more  familiar 
with  the  corpus  and  woe  probably  more  willing  to  keq)  definitions  for  patterns  that  they  had  seen  before. 
There  is  usually  a  trade-off  involved  in  making  these  decisions:  a  lib^al  strategy  will  often  result  in  higher 
recall  but  lowo'  [xecision  whereas  a  conservative  strategy  may  result  in  lower  re^  but  higher  (xecision. 

It  is  into'esting  to  note  that  even  though  there  was  great  variation  across  the  individual  dictionaries  (see 
Figure  6),  the  resulting  scores  were  very  similar.  This  may  be  because  some  definitions  can  contribute  a 
disproportionate  amount  of  performance  if  they  are  frequently  trigg^ed  by  a  given  test  set.  If  the  three 
dictionaries  woe  in  agreement  on  that  subset  of  the  dictionary  that  is  most  heavily  used,  those  definitions 
could  dominate  overall  system  performance.  Some  dictionary  definitions  are  more  important  than  others. 

To  summarize,  this  exp^iment  suggests  that  domain  experts  can  successfully  use  AutoSlog  to  build 
domain-specific  dictionaries  for  information  extraction.  With  only  l.S.  hours  of  training,  two  domain 
experts  constructed  dictionaries  that  achieved  perframance  comparable  to  a  dictionary  constructed  by  a 
UMass  researcher.  Although  this  was  only  a  small  experiment,  the  results  lend  credibility  to  the  claim  that 
domain  expats  can  build  effective  dictionaries  fiv  infomation  extraction. 
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WHAT  WORKS  AND  WHAT  NEEDS  WORK 

When  we  look  at  individual  texts  and  work  up  a  walk  through  analysis  of  what  is  and  is  not  working, 
we  find  that  many  of  our  trainable  language  componrats  are  working  very  well.  The  dictionary  coverage 
provided  by  AutoSlog  appears  to  be  quite  adequate.  OTB  is  operating  reliably  enough  for  subsequent 
sentence  analysis.  When  we  run  into  difficulties  with  our  trainable  components,  we  often  find  that  many  of 
these  difficulties  stem  from  a  mismatch  of  training  data  with  test  data.  For  example,  when  we  trained  the 
coreference  interface  for  appositive  recognition,  we  eliminated  from  the  training  data  all  candidate  pairs 
involving  locations  because  the  location  specialist  should  be  identifying  locations  fa:  us.  If  the  coreference 
classifier  were  operating  in  an  ideal  environment,  it  would  never  encounter  unrecognized  locations. 
Unfortunately,  as  we  saw  with  EJV  0S92,  the  location  specialist  does  not  tr^  all  the  locations,  and  this  led 
to  a  bad  coefoence  decision.  In  an  earlio'  versioi  of  the  coreference  classifier  we  had  trained  it  on  imperfect 
data  containing  unrecognized  locations,  but  as  the  location  specialist  improved,  we  felt  that  the  training  fo' 
the  coeference  classifier  was  falling  increasingly  out  of  sync  with  the  rest  of  the  system  so  we  updated  it  by 
eliminating  all  the  location  instances.  Then  when  the  coreference  classifier  was  confronted  with  an 
unrecognized  location,  it  failed  to  classify  it  correctly.  When  iq>stream  system  components  are  continually 
evolving  they  were  during  our  MUC-S  development  cycle),  it  is  difficult  to  synchronize  downstream 
dependencies  in  training  data.  A  better  system  development  cycle  would  stabilize  upstream  components 
before  training  downstream  components  in  order  to  maintain  the  best  possible  synchronization  across 
trainable  components. 

TTG  was  able  to  add  some  value  to  the  output  of  CIRCUS  and  subsequent  discourse  processing.  In 
module  tests,  TTG  typically  added  6-12%  of  accuracy  in  identifying  domain  objects  and  relationships.  That 
added  value  is  measured  against  picking  that  most  likely  class  (yes  or  no)  for  a  particular  domain  object 
(e.g.  JV-ENTITY  or  ME-LITHOGRAPHY)  or  relationship  (e.g.  JV-TIE-UP  or  ME- 
MICROELECTRONICS-CAPABILITY).  However,  TTG  fell  far  below  our  expectations  for  correctly 
filtering  and  connecting  the  parsa's  output.  We  find  two  reasons  for  this  short  fall.  First,  some  small 
deficit  can  be  attributed  to  the  system  development  cycle  since  TTG  sits  at  the  end  of  the  cycle  of  training 
and  testing  various  modules. 

The  second,  and  by  far  the  dominant  effect  comes  from  the  combination  of  the  training  algorithm  (ID3) 
and  the  amount  of  data.  As  mentioned  previously,  there  are  two  types  of  features  used  by  TTG:  (1)  closed 
class  (e.g.  token  type,  semantic  features,  and  CN  patterns)  and  (2)  open  class  features  (i.e.  CN  trigger 
words).  Using  open  class  features  can  be  difficult,  because  most  algorithms  cannot  detect  reliable 
discriminating  features  if  there  are  too  many  features — reliable  features  cannot  be  separated  from  noise. 
Using  trigger  words  in  conjunction  relations  between  memory  token  results  in  3,000-5,000  binary  features. 
With  no  noise  suppression  added  to  the  algorithm  and  given  a  large  number  of  features,  ID3  will  create  very 
deep  decision  trees  that  classify  stories  in  the  training  set  based  on  noise. 

We  ran  two  sets  of  decision  trees  in  deciding  how  to  coifigure  our  system  for  the  final  test  run.  MIN- 
TREES  using  only  closed  class  features  and  no  noise  suppression  and  MAX-TREE  using  closed  class  and 
open  class  features  and  a  noise  suppression  rule.  The  noise  suppression  was  a  terminatiai  condition  on  the 
recursion  of  the  ID3  algorithm.  Recursion  was  tominated  when  all  features  resulted  in  creating  a  node  that 
classified  examples  from  few  than  10  different  source  texts.  Using  closed  class  features  rarely  resulted  in  a 
terminal  node  that  classified  examples  from  fewer  than  10  stories.  In  all  tests  the  MAX-trees  performed 
better.  Howevo*,  as  a  result  of  the  noise  suppression,  no  decision  tree  contained  very  many  discriminations 
on  a  triggo-.  The  performance  of  the  MAX-trees  indicated  that  individual  words  are  good  discriminators, 
however  their  scarcity  in  the  decision  trees  indicates  that  we  are  not  using  the  appropriate  algorithm.  We 
believe  that  data-lean  algorithms  (such  as  explanation-based  learning)  in  concert  with  shared  knowledge 
bases  might  be  effective. 

In  attributing  performance  to  various  components,  we  measured  25  random  texts  in  EME.  At  the 
memory  token  stage  we  found  that  CIRCUS  had  extracted  string-fills  and  set-fills  with  a  recall^recision  of 
68/54.  However  our  score  output  for  those  slots  was  32/45  (measured  only  on  the  slots  we  attempted). 
Even  when  the  thresholds  for  TlX}  were  lowered  to  0.0,  so  that  all  output  came  through,  the  recall  was  not 
anywhere  near  68.  Therefore  it  would  appear  that  the  difficult  part  of  the  template  task  is  not  finding  good 
things  to  put  in  the  template,  but  figuring  how  to  split  and  merge  objects.  We  do  not  (yet)  have  a  trainable 
component  that  handles  splitting  and  merging  decisions  in  general. 
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The  EJV  and  EME  systems  that  we  tested  in  our  official  evaluation  were  in  many  ways  incomplete 
systems.  Although  our  upstream  components  were  operating  reasonably  well,  additional  fe^back  cycles 
were  badly  needed  for  other  components  operating  downstream.  In  particular,  trainable  coreference  and 
trainable  template  generation  did  not  received  the  time  and  attention  they  deserve.  We  are  generally 
encouraged  by  the  success  of  our  trainable  components  for  part-of-speech  tagging,  dictionary  generation, 
noun  phrase  analysis,  semantic  feature  tagging,  and  coreference  based  on  appositive  recognition.  But  we 
encountered  substantial  difficulties  with  general  coreference  prior  to  template  genoation.  This  £q)pears  to  be 
the  greatest  challoige  remaining  for  trainable  components  supporting  information  extraction.  We  know 
from  our  earlio*  work  in  the  domain  of  toroism  that  corefoence  resolution  can  be  reasonably  well-managed 
on  the  basis  of  hand-coded  heuristics  [Lehnert  et  al.  1992b].  But  this  type  of  solution  does  not  port  across 
domains  and  therefore  rqnesents  a  significant  system  development  botdoieck.  True  portability  will  only  be 
achieved  with  trainable  coref(^ence  capabilities. 

We  believe  that  trainable  discourse  analysis  was  the  major  stumbling  block  standing  between  our 
MUC-S  system  and  the  p^ormance  levels  attained  by  systems  incorporating  hand-coded  discourse  analysis. 
We  remain  c^timistic  that  state-of-the-art  performance  will  be  obtained  by  corpus-driven  machine  learning 
techniques  but  it  is  clear  that  more  research  is  needed  to  meet  this  very  important  challenge.  To  facilitate 
research  in  this  area  by  other  sites,  UMass  will  make  concept  extraction  training  data  (CIRCUS  output)  for 
the  full  EJV  and  EME  corpora  available  to  research  laboratories  with  internet  access.  When  pair^  with 
MUC-S  key  templates  available  fiom  the  Linguistk:  Data  Consortium,  this  data  will  allow  a  wide  range  of 
researchers  who  may  not  be  expats  in  natural  language  to  tackle  the  challenge  of  trainable  coreference  and 
template  generation  as  problems  in  machine  learning.  We  believe  it  is  important  for  the  NLP  community 
to  encourage  and  support  the  involvement  of  a  wider  research  community  in  our  quest  for  practical 
information  extracticxi  technologies. 
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